A Brief Introduction to Supervised Learning
![]()
https://www.mathworks.com/discovery/reinforcement-learning.html
- Unsupervised Learning: learn with unlabeled data
- Supervised Learning: learn with labeled data
- Reinforcement Learning: learn through reward maximization
We will focus on a subcategories of supervised learning: classification + regression
Data Structure
- Labels are categorical for classification
- Labels are steady for regression
- Example “Pima Indian Diabetes”: label “pos” vs. “neg”
| diabetes |
age |
mass |
pressure |
pregnant |
| pos |
50 |
33.6 |
72 |
6 |
| neg |
31 |
26.6 |
66 |
1 |
| pos |
32 |
23.3 |
64 |
8 |
| neg |
21 |
28.1 |
66 |
1 |
Classification
- Goal: predict label for new observations accurately based only on their features
- In other words: find a good decision rule to discriminate between label categories
- Here: simple linear decision rule. Predictions for new observations are determined by side of (hyper-) plane
Regression
- Goal: predict response for new observations based on their features
- In other words: find a good function approximation
- Here: simple polynomial function
Terminology
\(\mathcal{D}\) is data set of \(n\) observations \(\left( \mathbf{x}^{(i)}, y^{(i)} \right)\), \(i = 1, \ldots, n\)
- Feature vector \(\mathbf{x}^{(i)} \in \mathcal{X}^p\), e.g. \(\mathcal{X} \equiv \mathbb{R}^p\)
- Label \(y^{(i)} \in \mathcal{Y}\), here \(|\mathcal{Y}| = g\) is number of label categories and \(g \equiv 1\) for regression
Terminology
A function which assigns a prediction to a feature vector is called ML model: \[f: \mathcal{X} \to \mathbb{R}^g\] \[f \left(\mathbf{x}^{(i)} \right) =: \hat{y}^{(i)}\]
- \(f\) yields a probability (or score) for each of the \(g\) categories
- What are desired properties of \(f\)?
- How to find a “good” ML model \(f\)?
Terminology
Optimize the generalization error via empirical risk minimization: \[\mathcal{R}_{\text{emp}}(f) := \frac{1}{n} \sum_{i=1}^n L \left( y^{(i)}, f(\mathbf{x}^{(i)}) \right) \]
- L is an arbitrary loss function with \(L: \mathbb{R}^g \times \mathbb{R}^g \to \mathbb{R}\)
- Popular choices for \(L\):
- Accuracy: \(L(c_1, c_2) = \mathcal{I}(c_1 = c_2)\)
- Mean Squared Error: \(L(x_1, x_2) = \left( x_1 - x_2 \right)^2\)
Underfitting
The function \(f\) is not capable of capturing the true relationship between features and labels.
![]()
Underfitting can be detected by poor model performance on training and test data.
Overfitting
The function \(f\) starts modeling the noise.
![]()
Overfitting can be detected by poor model performance only on test data. On training data, the performance can be arbitrarily good by just replicating the data.
Evaluation
Important: You must evaluate on new, unseen observations!
![]()
Evaluation on independent test data
Model Selection
The choice of the model depends on:
- generalization performance
- runtime and available hardware
- experience and familiarity of the machine learner
- interpretability of the resulting model
- other constraints of the application
In data mining competitions, we ultimately optimize for (1), but all other points are important to consider
Model Selection
Suggested procedure:
- Start with some baselines: constantly predict majority class
- Compare to simple and interpretable ML models like logistic regression, classification trees, or handcrafted rules
- Analyze the models, try to improve them via feature engineering
- Compare to more complex models (with engineered features) and start hyperparameter tuning
Know your Models
- The “no free lunch” theorem holds - there is not a single model which works well on all data sets.
- You usually compare multiple competitors and take the “best”
- It is important to understand the strengths of models and how to work around their weaknesses
- Example: Circle data set and linear decision rule
Circle Example
![]()
Define \(x_3 := (x_1^2 + x_2^2)\)